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Abstract 

This paper presents the current state of our work 
on an interactive toplevel for the OCaml language 
based on the optimizing native code compiler and 
runtime. Our native toplevel is up to 100 times 
faster than the default OCaml toplevel, which is 
based on the byte code compiler and interpreter. 
It uses Just-In-Time techniques to compile toplevel 
phrases to native code at runtime, and currently 
works with various Unix-like systems running on 
x86 or x86-64 processors. 



1 Introduction 

The OCaml [171 132] system is the main implemen- 
tation of the Caml language [S], featuring a pow- 
erful module system combined with a full-fledged 
object-oriented layer. It ships with an optimiz- 
ing native code compiler ocamlopt, for high per- 
formance; a byte code compiler ocEmilc and inter- 
preter ocamlrun, for increased portability; and an 
interactive top-level occunl based on the byte code 
compiler and runtime, for interactive use of OCaml 
through a read-eval-print loop. 

occunlG and ocaml translate the source code into 
a sequence of byte code instructions for the OCaml 
virtual machine occunlrun, which is based on the 
ZINC machine [TB] originally developed for Caml 
Light '18]. The optimizing native code compiler 
ocamlopt produces fast machine code for the sup- 
ported targets (at the time of this writing, these 
are Alpha, ARM, Itanum, Motorola 68k, MIPS, 
PA-RISC, PowerPC, Sparc, and x86/x86-64), but 



is currently only applicable to static program com- 
pilation. For example, it cannot yet be used with 
multi-stage programming in MetaOCaml [5^ 155]. 
or the interactive toplevel ocaml. 

This paper presents our worl([^ on a new native 
OCaml toplevel, called ocamlnat, which is based 
on the native runtime, the compilation engine of 
the optimizing native code compiler and an earlier 
prototype implementation of a native toplevel by 
Alain Frisch. Our implementation currently sup- 
ports x86 and x86-64 processors [TJ [TD] and should 
work with any POSIX compliant operating system 
supported by the OCaml native code compiler. It 
is verified to work with Mac OS X 10.6 and 10.7, 
Debian GNU/Linux 6.0 and above, and CentOS 5.6 
and 5.7. The full source code is available from the 
ocaml jit-nat branch of the ocaml-experimental 
Git repository hosted on GitHub at [25] . 

The paper is organized as follows: Section [2] 
motivates the need for a usable native OCaml 
toplevel. Section [3| presents an overview of the 
OCaml compilers and Section |4| describes the pre- 
vious ocamlnat prototype which inspired our work, 
while Section [5] presents our work on ocEmilnat. 
Performance measures are given in Section [6| Sec- 
tions [T] and [8| conclude with possible directions for 
further work. 



2 Motivation 

Interactive toplevels are quite popular among dy- 
namic and scripting languages like Perl, Python, 
Ruby and Shell, but also with functional program- 
ming languages like OCaml, Haskell and LISP. In 
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^The initial work in this area was done as part of the first 
author's diploma thesis. 
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case of scripting languages the interactive toplevel 
is usually the only frontend to the underlying in- 
terpreter or Just-In-Time compiler. 

In case of OCaml, the interactive toplevel is only 
one possible interface to the byte code interpreter; 
it is also possible to separately compile source files 
to byte or native code object files, link them into 
libraries or executables, and deploy these libraries 
or executables. The OCaml toplevel is therefore 
mostly used for interactive development, rapid pro- 
totyping, teaching and learning, as an interactive 
program console or for scripting purposes. 

The byte code runtime is the obvious candidate 
to drive the interactive toplevel, because the plat- 
form independent byte code is very portable and 
easy to generate - compared to native machine 
code. And in fact the byte code toplevel has served 
users and developers well during the last years. But 
nevertheless there are valid reasons to have a native 
code toplevel instead of or in addition to the byte 
code toplevel: 

Performance This is probably the main reason 
why one wants to have a native code toplevel. 
While the performance of the byte code interpreter 
is acceptable in many cases (which can be improved 
by using one of the available Just-In-Time compil- 
ers [23 [Ml Uni 133]) J it is not always sufficient to 
handle the necessary computations. Sometimes one 
needs the execution speed of the native runtime, 
which can be up to hundred times faster than the 
byte code runtime as we will show in Section [6] 

For example, the Mancoosi project [24] has de- 
veloped a library that allows to perform analysis 
of large sets of packages in free software distribu- 
tions, that can be done acceptably efficiently with 
the native code compiler and runtime, but are too 
slow in bytecode. To perform interactive analy- 
sis (i.e. select packages with particular properties, 
analyse them, . . . ), having a native toplevel is re- 
ally the only way to go for them, as it can combine 
the fiexibility of the toplevel interaction with the 
speed of native code. 

Tools such as ocamlscript 25^ try to combine the 
performance of the code generated by the optimiz- 
ing native code compiler with the flexibility of a 
"scripting language interface" . But this is basically 
just a work-around - with several limitations. A 
native toplevel would address this issue in a much 



cleaner and simpler way. 

Native runtime There are scenarios where only 
the native code runtime is available and hence the 
byte code toplevel, which depends on the byte code 
runtime, cannot be used. One recent example here 
is the Mirage cloud operating system [2TJ [22l [23] , 
which compiles OCaml programs to Xen micro- 
kernels 14j and executes them via the Xen hyper- 
visor [3S]. Mirage uses the OCaml toplevel as OS 
console, but is currently limited to the byte code 
toplevel in read-only mode, due to the lack of a 
toplevel that works with the native code runtime. 

3 Overview of the OCaml 
compilers 

In this section we briefly describe the OCaml com- 
pilers, covering both the byte code compiler ocsmilc 
and the optimizing native code compiler ocamilopt. 
Feel free to skip to section |4| if you are already fa- 
miliar with the details. 
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Figure 1: The OCaml compilers 

Figure [T] gives an overview of the compiler phases 
and representations in the OCaml byte and na- 
tive code compilers. Compilation always starts 
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by parsing an OCaml source program (either from 
a source file or a source region in interactive 
mode) into an abstract syntax tree (AST, see 
file parsing/parsetree .mli of the OCaml source 
code). Compilation then proceeds by computing 
the type annotations to produce a typed syntax 
tree (see file typing/typedtree .mli). 

From this typed syntax tree, the compiler gen- 
erates a so called lambda representation (see file 
bytecomp/lambda.mli) inspired by the untyped 
call-by- value A-calculus [31 [TTJ [3D]. This lambda 
representation is then optimized by transforming 
lambda trees into better or smaller lambda trees 
(see file bytecomp/simplif .ml), yielding a final 
platform independent, internal representation of 
the source program as result of the compiler fron- 
tend phases. 

The simplified lambda representation is then 
used as input for the respective compiler backend, 
which is either 

• the Bytegen module in case of the byte code 
compiler (see file bytecomp/bytegen.ml), or 

• the Asmgen module in case of the op- 
timizing native code compiler (see file 
asmcomp/ asmgen . ml). 

The byte code backend, which is used by the 
byte code compiler ocamlc as well as the byte code 
toplevel ocsmil, basically transforms the simpli- 
fied lambda representation into an equivalent byte 
code program (see file bytecomp/instruct .mli), 
suitable for (a) direct execution by the byte code 
interpreter ocamlrun or (b) just-in-time compila- 
tion using either OCAMLJIT [33] or 0CamlJIT2 
[261 [23 [29]. This is done by the Emit code module 
(see file bytecomp/emitcode .ml). Additional de- 
tails about the byte code compiler and runtime can 
be found in [H], [17] and [3^. 

The native code backend, which is used by 
the optimizing native code compiler ocsmilopt as 
well as the native toplevel ocamlnat, is shown in 
Figure [2] It takes the simplified lambda repre- 
sentation as input and starts by transforming it 
into a variant of the lambda representation (see 
file asmcomp/clambda.mli) with explicit closures 
and explicit direct /indirect function calls (see file 
asmcomp/closure .ml). This is then further pro- 
cessed and transformed into an equivalent repre- 
sentation in an internal dialect of C~ [HI [T3] (see 



files asmcomp/cmm.mli and asmcomp/ cmmgen. ml). 
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Figure 2: Native code generation (Asmgen module) 

Afterwards the Instruction selection phase (see 
file asmcomp/selection.ml) picks appropriate in- 
structions for the target machine, transforming the 
C~ code into a tree based representation of the 
machine code (see file asmcomp/mach.mli). The 
next step attempts to combine multiple heap al- 
locations within a basic block into a single heap 
allocation (see file asmcomp/comballoc.ml), prior 
to allocating and assigning physical registers to 
the virtual registers used in the machine code (see 
function regalloc in file asmcomp/asmgen.ml). 
The final phases linearize the machine code (see 
file asmcomp/linearize .ml) and perform instruc- 
tion scheduling for better performance (see file 
asmcomp/scheduling.ml), yielding the final rep- 
resentation of the (linearized) machine code. 

The optimizing native code compiler ocamlopt 
writes the linearized machine code output of the 
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Asmgen module to an assembly file in the ap- 
propriate format for the target platform (see file 
asmcomp/emit .ml), i.e. using AT&T assembly syn- 
tax on Linux and Mac OS X while using Intel as- 
sembly syntax on Windows, and invokes the assem- 
bler from the system compiler toolchain, i.e. GNU 
as on Linux, to generate an object file. This object 
file can then be linked with other OCaml modules 
and C code into an executable binary or a dynamic 
library file. 

4 The native toplevel 

In 2007 Alain Frisch added support for the Dynlink 
library to the native code compiler and run- 
time, which was first made available as part of 
OCaml 3.11. This change made it possible to use 
the OCaml native code runtime with dynamically 
loaded plugins, a feature that was previously only 
available with the byte code runtime. Besides var- 
ious other benefits, this also made it possible to 
reuse the existing functionality of the optimizing 
native code compiler within the scope of a native 
toplevel. 

The initial proof-of-concept prototype of a na- 
tive toplevel, developed by Alain Frisch and named 
ocamlnat, was since then silently shipped with ev- 
ery OCaml source code release]^ 

Figure [3] gives an overview of the internals of this 
ocamlnat prototype. It works by starting up the 
OCaml native runtime and then prompts the user 
for OCaml phrases to evaluate (just like the byte 
code toplevel ocaml does). Whenever the user en- 
ters a phrase, it is compiled to native code using 
the modules of the optimizing native code compiler 
(utilizing the frontend phases as shown in Figure [l] 
and the native backend phases as shown in Fig- 
ure |2|. 

This native code is written to a temporary as- 
sembly file by the Native code emitter, which is 
also part of ocsmilopt. The assembly file is then 
passed to the Toolchain Assembler, i.e. GNU as on 
Linux, to produce a temporary object file. This 06- 
ject file is afterwards turned into a dynamic library 
file by the Toolchain Linker, i.e. GNU Id on Linux, 
and loaded into the native toplevel process using 

•^It must be build explicitly using make ocamlnat after 
make world and make opt, and it is only available for targets 
that support the native Dynlink library. 
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Figure 3: ocamlnat prototype 

the Runtime Linker, finally yielding a memory area 
with the executable code which is then executed. 

While this approach has the immediate benefit 
of requiring only a few hundred lines of glue code 
to turn the existing modules of the optimizing na- 
tive code compiler and the native Dynlink library 
into a native toplevel, there are also several obvi- 
ous drawbacks to this approach - preventing wide- 
spread adoption of ocamlnat: 

Dependency on the system toolchain This is 
the most important problem of the native toplevel 
prototype as it prevents from being used in areas 
that would really benefit from a native toplevel 
but do not have the toolchain programs avail- 
able. For example, the Mirage cloud operating sys- 
tem [211 [221 [21] compiles OCaml programs to Xen 
micro-kernels, which are then executed by the Xen 
hypervisor [35] ; Mirage uses the OCaml toplevel as 
OS console, but is limited to the byte code toplevel 
in read-only mode right now, as there is obviously 
no GNU toolchain available in a Xen micro-kernel. 

It is worth noting that the toolchain dependency 
is also a problem with the optimizing native code 
compiler ocamlopt on certain platforms such as 
Microsoft Windows where it is often a non-trivial 
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task to install the system toolchain. This is one of 
the reasons why companies such as LexiFi provide 
custom OCaml distributions with an integrated 
toolchain. 

Latency While the latency caused by reading 
and writing the assembly, object and library files 
as well as invoking the external toolchain programs 
is not necessarily a show-stopper for an interac- 
tive toplevel, it is nevertheless quite noticable, es- 
pecially with short running programs or programs 
with many phrases, as we will see in Section |6] 

Temporary library files On Microsoft Win- 
dows it is impossible to delete a library file that is 
currently loaded into a process, which means that 
the prototype "leaks" one library file per toplevel 
phrase. 

Unclear maintenance status Many people 
don't even know about ocsmilnat, and those who 
do cannot rely on it. This is because ocsmilnat is 
not part of an OCaml installation, even though it 
ships as part of the source code distribution, and it 
is not documented anywhere. 

This is not so much a technical argument against 
the current approach, but it highlights its status 
as being a proof-of-concept with no clear direction 
from the users point of view. 

5 Just-In-Time code genera- 
tion 

We aim to improve ocamilnat in a way that avoids 
the drawbacks of the earlier prototype and turns 
the native toplevel into a viable alternative to the 
byte code toplevel. As noted above, the major 
drawback of Alain's prototype is the dependency 
on the system toolchain, that is, the external as- 
sembler and linker programs. 

Therefore we had to replace the last four phases 
of the ocamilnat prototype (as shown in Figure [s]) 
with something that does not depend on any exter- 
nal programs but does the executable code gener- 
ation just- in-time within the process of the native 
toplevel. 

Figure [4] shows our current implementation. We 
replaced the Native code emitter and Toolchain As- 
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Figure 4: ocamlnat overview 

sembler phases from the ocamlnat prototype with 
a Just-In- Time Emitter phase, and the Toolchain 
Linker and Runtime Linker phases with a Just-In- 
Time Linker phase. The earlier phases, that are 
shared with the optimizing native code compiler 
oc ami opt as described in Section [3] and |4j remain 
unchanged. 

The Just-In-Time Emitter phase is responsible 
for transforming the linearized native code that is 
generated by the Native code compiler (as shown 
in Figure [2| into object code for the target plat- 
form. This object code is very similar to the ob- 
ject file generated by the Toolchain Assembler; it 
contains a text section with the executable code, 
a data section with the associated data items (i.e. 
the fioating-point constants, closures and string lit- 
erals used within the code, the frametablc for the 
garbage collector, . . . ), a list of relocations, and a 
list of global symbols. 

The Just-In-Time Linker phase allocates exe- 
cutable memory for the text section and writable 
memory for the data section, copies the section 
contents to their final memory locations, takes care 
of the relocations, and registers the global sym- 
bols. This is roughly what the Toolchain Linker 
and Runtime Linker in the ocamlnat prototype do. 

The code for the two phases is found in 
the toplevel/ jitaux. ml file, which provides 
the common, platform independent functional- 
ity for Just-In-Time code generation, as well 
as toplevel/amd64/j it .ml for the x86-64 plat- 
form and toplevel/i386/j it .ml for the x86 plat- 
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form, plus a few lines of additional C code in 
asmrun/natdynlink. c and asmrun/nat j it . c. At 

the time of this writing the changes for our new 
native toplevel with support for x86 and x86-64 ac- 
count for approximately 2300 lines of C and OCaml 
code as shown in Table [TJ 
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Table 1: Additional lines of code for ocamlnat 

We tried to keep the code as easy to maintain as 
possible. To achieve this goal we 

(a) reused as much of the existing functionality as 
possible of both the native code compiler and 
its runtime, 

(b) kept the amount of additional runtime support 
as low as possible (basically just an additional 
layer in the global symbol management and a 
new entry point for the Just-In-Time code ex- 
ecution), and 

(c) made the Just-In-Time Emitter (the j it . ml 
files in the toplevel subdirectories) look as 
similar as possible to the Native Code Emitter 
(the emit.mlp files in the asmcomp subdirecto- 
ries). 

The last point is especially important as ev- 
ery change to the Native Code Emitters (i.e. 
asmcomp/amd64/emit .mlp) must be reflected by 
an equivalent change to the appropriate Just-In- 
Time Emitter (i.e. toplevel/cmid64/jit.ml). For- 
tunately the Native Code Emitters usually do not 
change very often during the OCaml development 
process. 

6 Performance 

We compared the performance of our native 
toplevel to the performance of the byte code 
toplevel ocaml running on top of the OCaml 3.12.1 
byte code interpreter, the byte code toplevel ocaml 
running on top of the 0CamlJIT2 Just-In-Time 
byte code compiler [^niHTlUS]: and Alain Frisch's 



earlier ocamlnat proof-of-concept implementation 
(as described in Section |4|. We measured the per- 
formance on four different systems: 

• A MacBook Pro 13" (Early 2011) with an Intel 
Core 17 2.7GHz CPU (4 MiB L3 Cache, 256 
KiB L2 Cache per Core, 2 Cores) and 4 GiB 
RAM, running Mac OS X Lion 10.7.1. The C 
compiler is llvm-gcc-4. 2 . 1 (Based on Apple 
Inc. build 5658) (LLVM build 2336.1.00). 

• An iMac 20" (Early 2008) with an Intel Core 
2 Duo "Penryn" 2.66GHz CPU (6 MiB L2 
Cache, 2 Cores), and 4 GiB RAM, running 
Mac OS X Lion 10.7.1. The C compiler is 
llvm-gcc-4. 2 . 1 (Based on Apple Inc. build 
5658) (LLVM build 2336.1.00). 

• A Fujitsu Siemens Primergy server with two 
Intel Xeon E5520 2.26GHz CPUs (8 MiB 
L2 Cache, 4 Cores), and 12 GiB RAM, 
running CentOS release 5.7 (Final) with 
Linux/x86_64 2.6.18-274.3.1.el5. The C com- 
piler is gcc-4.1.2 (Red Hat 4.1.2-51). 

• A Fujitsu Siemens Primergy server with an In- 
tel Pentium 4 "Northwood" 2.4 GHz CPU (512 
KiB L2 Cache), and 768 MiB RAM, running 
Debian testing as of 2011/09 with Linux/i686 
3.0.0-1-686-pae. The C compiler is gcc-4.6. 1 
(Debian 4.6.1-4). 

The OCaml distribution used for the tests is 
3.12.1. The 0CamlJIT2 version is the com- 
mit 8514ccb from the ocaml jit2 Git repository 
hosted on GitHub at [29]. For our native toplevel 
ocamlnat we used the commit d30210d from the 
ocaml-experimental Git repository hosted on 
GitHub at [H]. 

The benchmark programs used to measure the 
performance are the following test programs from 
the testsuite/test folder of the OCaml 3.12.1 
distribution: 

almabench is a number-crunching benchmark de- 
signed for cross-language comparisons. 

bdd is an implementation of binary decision dia- 
grams, and therefore a good test for the sym- 
bolic computation performance. 

boyer is a term manipulation benchmark. 
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fft is an implementation of the Fast Fourier Trans- 
formation [S]. 

nucleic is another floating-point benchmark. 

quicksort is an implementation of the well-known 
Quicksort algorithm [S','^ on arrays and serves 
as a good test for loops. 

sieve is an implementation of the sieve of Er- 
atosthenes, one of a number of prime number 
sieves, for finding all prime numbers up to a 
specified integer. 

soli is a simple solitaire solver, well suited for test- 
ing the performance of non-trivial, short run- 
ning programs. 

sorts is a test bench for various sorting algorithms. 

For our tests we measured the total execu- 
tion time of the benchmark process itself and all 
spawned child processes (only relevant for the ear- 
lier toolchain based ocamlnat prototype), given as 
combined system and user CPU time. The times 
were collected by executing each benchmark two 



times with every toplevel, and using the timings of 
the fastest run. 

Figure [5] presents the results of the bench- 
marks as speedup relative to the regular byte 
code toplevel ocaml, where OCamlNat/Toolchain 
is Alain Frisch's toolchain based ocamlnat proto- 
type and OCamlNat/JIT is our new Just-In-Time 
based native toplevel. As you can see we man- 
aged to achieve speedups of up to hundred times 
faster than the byte code toplevel in certain bench- 
marks. It is however worth noting that this is in 
part related to the fact that llvm-gcc became the 
default C compiler with recent versions of OS X 
(and it's related software development tools), which 
disables the very important manual register assign- 
ment optimization in the byte code interpreter, be- 
cause LLVM does not support manual register as- 
signment. 

7 Related and further work 

We are looking forward to integrate our new na- 
tive toplevel as interactive console into the Mirage 
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cloud operating system [211 [221 [23] , which will be 
really useful for development (i.e. when exploring 
the heap of Xen kernels interactively to find re- 
source leaks, . . . ). 

Right now our ocamilnat implementation sup- 
ports only x86 and x86-64 targets on POSIX sys- 
tems. We plan to add support for additional tar- 
gets that are already supported by the optimizing 
native code compiler, most notably ARM and Pow- 
erPC, and extend the support for other operating 
systems, i.e. Windows. 

We are also working on integrating the linear 
scan register allocator [3T1 [37] into the optimizing 
native code compiler as an alternative to the cur- 
rently used graph coloring register allocator [2] . By 
using the linear scan algorithm, which is commonly 
used in the scope of Just-In- Time compilers, we ex- 
pect to avoid the quadratic worst case running time 
of the graph coloring algorithrrj^ 

We are also in the process of evaluating the use 
of the LLVM compiler infrastructure [TH [TS] [IH] as 
a replacement for the last phases of the native code 
compiler engine. Other projects such as Clang |7], 
GHC [3B] and MacRuby [20] already demonstrated 
the viability and usefulness of using LLVM as com- 
piler backend. Besides the other obvious benefits 
of using LLVM in OCaml, we would also get the 
Just-In-Time compilation and execution engine for 
free. 



8 Conclusion 

Our results demonstrate that an OCaml toplevel 
based on the native code compiler and runtime of- 
fers significant performance improvements over the 
byte code toplevel (at least thrice as fast in all 
benchmarks, and up to hundred times faster in one 
benchmark), at acceptable maintenance costs. 

As demonstrated in Section[6]we were also able to 
beat the performance of our earlier byte code based 
0CamlJIT2 prototype in almost every case (except 
for the almabench benchmark on x86, which is due 
to the fact that 0CamlJIT2 uses SSE2 registers 
and instructions while our native toplevel uses the 
x87 FPU stack and instructions 26J). 



■^This work is also part of the first author's diploma thesis. 
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